24 research outputs found
Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access
Scientific applications often contain large computationally-intensive
parallel loops. Loop scheduling techniques aim to achieve load balanced
executions of such applications. For distributed-memory systems, existing
dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a
master-worker execution model to assign variably-sized chunks of loop
iterations. The master-worker execution model may adversely impact performance
due to the master-level contention. This work proposes a distributed
chunk-calculation approach that does not require the master-worker execution
scheme. Moreover, it considers the novel features in the latest MPI standards,
such as passive-target remote memory access, shared-memory window creation, and
atomic read-modify-write operations. To evaluate the proposed approach, five
well-known DLS techniques, two applications, and two heterogeneous hardware
setups have been considered. The DLS techniques implemented using the proposed
approach outperformed their counterparts implemented using the traditional
master-worker execution model
Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach
Computationally-intensive loops are the primary source of parallelism in
scientific applications. Such loops are often irregular and a balanced
execution of their loop iterations is critical for achieving high performance.
However, several factors may lead to an imbalanced load execution, such as
problem characteristics, algorithmic, and systemic variations. Dynamic loop
self-scheduling (DLS) techniques are devised to mitigate these factors, and
consequently, improve application performance. On distributed-memory systems,
DLS techniques can be implemented using a hierarchical master-worker execution
model and are, therefore, called hierarchical DLS techniques. These techniques
self-schedule loop iterations at two levels of hardware parallelism: across and
within compute nodes. Hybrid programming approaches that combine the message
passing interface (MPI) with open multi-processing (OpenMP) dominate the
implementation of hierarchical DLS techniques. The MPI-3 standard includes the
feature of sharing memory regions among MPI processes. This feature introduced
the MPI+MPI approach that simplifies the implementation of parallel scientific
applications. The present work designs and implements hierarchical DLS
techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques
are considered in the evaluation proposed herein. The results indicate certain
performance advantages of the proposed approach compared to the hybrid
MPI+OpenMP approach
Efficient Generation of Parallel Spin-images Using Dynamic Loop Scheduling
High performance computing (HPC) systems underwent a significant increase in
their processing capabilities. Modern HPC systems combine large numbers of
homogeneous and heterogeneous computing resources. Scalability is, therefore,
an essential aspect of scientific applications to efficiently exploit the
massive parallelism of modern HPC systems. This work introduces an efficient
version of the parallel spin-image algorithm (PSIA), called EPSIA. The PSIA is
a parallel version of the spin-image algorithm (SIA). The (P)SIA is used in
various domains, such as 3D object recognition, categorization, and 3D face
recognition. EPSIA refers to the extended version of the PSIA that integrates
various well-known dynamic loop scheduling (DLS) techniques. The present work:
(1) Proposes EPSIA, a novel flexible version of PSIA; (2) Showcases the
benefits of applying DLS techniques for optimizing the performance of the PSIA;
(3) Assesses the performance of the proposed EPSIA by conducting several
scalability experiments. The performance results are promising and show that
using well-known DLS techniques, the performance of the EPSIA outperforms the
performance of the PSIA by a factor of 1.2 and 2 for homogeneous and
heterogeneous computing resources, respectively
Performance Reproduction and Prediction of Selected Dynamic Loop Scheduling Experiments
Scientific applications are complex, large, and often exhibit irregular and
stochastic behavior. The use of efficient loop scheduling techniques in
computationally-intensive applications is crucial for improving their
performance on high-performance computing (HPC) platforms. A number of dynamic
loop scheduling (DLS) techniques have been proposed between the late 1980s and
early 2000s, and efficiently used in scientific applications. In most cases,
the computing systems on which they have been tested and validated are no
longer available. This work is concerned with the minimization of the sources
of uncertainty in the implementation of DLS techniques to avoid unnecessary
influences on the performance of scientific applications. Therefore, it is
important to ensure that the DLS techniques employed in scientific applications
today adhere to their original design goals and specifications. The goal of
this work is to attain and increase the trust in the implementation of DLS
techniques in present studies. To achieve this goal, the performance of a
selection of scheduling experiments from the 1992 original work that introduced
factoring is reproduced and predicted via both, simulative and native
experimentation. The experiments show that the simulation reproduces the
performance achieved on the past computing platform and accurately predicts the
performance achieved on the present computing platform. The performance
reproduction and prediction confirm that the present implementation of the DLS
techniques considered both, in simulation and natively, adheres to their
original description. The results confirm the hypothesis that reproducing
experiments of identical scheduling scenarios on past and modern hardware leads
to an entirely different behavior from expected
A Methodology for Bridging the Native and Simulated Execution of Parallel Applications
Simulation is considered as the third pillar of science, following experimentation and theory. Bridging the native and simulated executions of parallel applications is needed for attaining trustworthiness in simulation results. Yet, bridging the native and simulated executions of parallel applications is challenging. This work proposes a methodology for bridging the native and simulated executions of message passing parallel applications on high performance computing (HPC) systems in two steps: Expression of the software characteristics, and representation and verification of the hardware characteristics in the simulation. This work exploits the capabilities of the SimGrid [3] simulation toolkit’s interfaces to reduce the effort of bridging the native and simulated executions of a parallel application on an HPC system. For an application from computer vision, the simulation of its parallel execution using straightforward parallelization on an HPC cluster approaches the native performance with a minimum relative percentage difference of 5.6%
Exploring the Relation between Two Levels of scheduling Using a Novel Simulation Approach
Modern high performance computing (HPC) systems exhibit a rapid growth in size, both “horizontally” in the number of nodes, as well as “vertically” in the number of cores per node. As such, they offer additional levels of hardware parallelism. Each level requires and employs algorithms for appropriately scheduling the computational work at the respective level. The present work explores the relation between two scheduling levels: batch and application. To understand and explore this relation, a novel simulation approach is presented that bridges two existing simulators from the two scheduling levels. A novel two-level simulator that implements the proposed
approach is introduced. The two-level simulator is used to simulate all combinations of three batch scheduling and four application scheduling algorithms from the literature. These combinations are considered for allocating resources and executing the parallel jobs from a workload of a production HPC system. The results of the scheduling experiments reveal the strong relation between decisions taken at the two scheduling levels and their mutual influence. Complementing the simulations, the two-level simulator produces abstract parallel execution traces, which can visually be examined and illustrate the execution of different jobs and, for each job, the execution of its tasks at node and core levels, respectively
Simulating Batch and Application Level Scheduling Using GridSim and SimGrid
Modern high performance computing (HPC) sys- tems are increasing in the complexity of their design and in the levels of parallelism they offer. Studying and enhancing scheduling in HPC became very interesting for two main as- pects. First, scheduling decisions are taken by different types of schedulers such as batch, application, process, and thread schedulers. Second, simulation has become an important tool to examine the design of HPC systems. Therefore, in this work, we study the simulation of different scheduling levels. We used two well-known simulation toolkits, SimGrid and GridSim, in order to support two different scheduling levels, batch and application level scheduling. Each toolkit is extended to support both levels. Moreover, three different scheduling algorithms for each level are implemented and their performance is examined through a real workload dataset. Finally, a comparison for the extension challenges of the two simulators is conducted
Dynamic Loop Scheduling Using the MPI Passive-Target Remote Memory Access Model
Large parallel loops are present in many scientific applications. Static and dynamic loop scheduling (DLS) techniques aim to achieve load balanced executions of applications. The use of DLS techniques in scientific applications, such as the self-scheduling-based techniques, showed significant performance advantages compared to static techniques. On distributed-memory systems, DLS techniques have been implemented using the message-passing interface (MPI). Existing implementations of MPI-based DLS libraries do not consider the novel features of the latest MPI standards, such as one-sided communication, shared-memory window creation, and atomic read-modify-write operations. This poster considers these features and proposes an MPI-based DLS library written in the C language. Unlike existing libraries, the proposed DLS library does not employ a master-worker execution model. Moreover, it contains implementations of five well-known DLS techniques, namely self-scheduling, fixed-size chunking, guided self-scheduling, trapezoid self-scheduling, and factoring. An application from the computer vision is used to assess and compare the performance of the proposed library against the performance of existing solutions. The evaluation results show improved performance and highlight the need to revise and upgrade existing solutions in light of the significant advancements in the MPI standards